Data Placement and Replica Selection for Improving Co-location in Distributed Environments

نویسندگان

K. Ashwin Kumar

Amol Deshpande

Samir Khuller

چکیده

Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative frameworks like MapReduce. There is thus an increasing contention on scarce data center resources like network bandwidth (especially cross-rack bandwidth); further, the energy requirements for powering the computing equipment are also growing dramatically. As we show empirically, increasing the execution parallelism by spreading out data across a large number of machines may achieve the intended goal of decreasing query latencies, but in most cases, may increase the total resource and energy consumption significantly. For many analytical workloads, however, minimizing query latencies is often not critical; in such scenarios, we argue that we should instead focus on minimizing the average query span, i.e., the average number of machines that are involved in processing of a query, through colocation of data items that are frequently accessed together. In this work, we exploit the fact that most distributed environments need to use replication for fault tolerance, and we devise workload-driven replica selection and placement algorithms that attempt to minimize the average query span. We model a historical query workload trace as a hypergraph over a set of data items (which could be relation partitions, or file chunks), and formulate and analyze the problem of replica placement by drawing connections to several well-studied graph theoretic concepts. We use these connections to develop a series of algorithms to decide which data items to replicate, and where to place the replicas. We show effectiveness of our proposed approach by building a trace-driven simulation framework and by presenting results on a collection of synthetic and real workloads. Our experiments show that careful data placement and replication can dramatically reduce the average query spans resulting in significant reductions in the resource consumption.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Data Replication Strategy in Large-Scale Data Grid Environments Based on Availability and Popularity

The data grid technology, which uses the scale of the Internet to solve storage limitation for the huge amount of data, has become one of the hot research topics. Recently, data replication strategies have been widely employed in distributed environment to copy frequently accessed data in suitable sites. The primary purposes are shortening distance of file transmission and achieving files from ...

متن کامل

Improving Data Grids Performance by Using Modified Dynamic Hierarchical Replication Strategy

Abstract: A Data Grid connects a collection of geographically distributed computational and storage resources that enables users to share data and other resources. Data replication, a technique much discussed by Data Grid researchers in recent years creates multiple copies of file and places them in various locations to shorten file access times. In this paper, a dynamic data replication strate...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Locating and Offering Optimal Price Distributed Generation Resources to Increase Profit Using Ant Lion Optimization Algorithm

Distribution of distributed generation resources in distribution systems has several advantages, including reducing losses, improving voltage profiles, reducing pollution, and increasing system reliability. However, one of the most important points regarding the placement of these resources in distribution networks is economic issues and the return on investment and the increase in profits from...

متن کامل

A GA-Based Replica Placement Mechanism for Data Grid

Data Grid is an infrastructure that manages huge amount of data files, and provides intensive computational resources across geographically distributed collaboration. To increase resource availability and to ease resource sharing in such environment, there is a need for replication services. Data replication is one of the methods used to improve the performance of data access in distributed sys...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

CoRR

دوره abs/1302.4168 شماره

صفحات -

تاریخ انتشار 2012

Data Placement and Replica Selection for Improving Co-location in Distributed Environments

نویسندگان

چکیده

منابع مشابه

An Efficient Data Replication Strategy in Large-Scale Data Grid Environments Based on Availability and Popularity

Improving Data Grids Performance by Using Modified Dynamic Hierarchical Replication Strategy

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Locating and Offering Optimal Price Distributed Generation Resources to Increase Profit Using Ant Lion Optimization Algorithm

A GA-Based Replica Placement Mechanism for Data Grid

عنوان ژورنال:

اشتراک گذاری